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Abstract 

In recent years, information retrieval algorithms have 
taken center stage for extracting important data in ever 
larger datasets. During computation, these datasets are 
both stored in memory and on disk. As datasets become 
larger demand for superior storage solutions will rise. Ad- 
vances in hardware technology have lead to the increas- 
ingly wide spread use of flash storage devices. Such de- 
vices have clear benefits over traditional hard drives in 
terms of latency of access, bandwidth and random access 
capabilities particularly when reading data. Thus tradi- 
tional informational retrieval algorithms, such as TF-IDF, 
can greatly benefit. There are however some interesting 
trade-offs to consider when leveraging the advanced fea- 
tures of such devices. On a relative scale writing to such 
devices can be expensive. This is because typical flash 
devices (NAND technology) are updated in blocks. A mi- 
nor update to a given block requires the entire block to 
be erased, followed by a re-writing of the block. On the 
other hand, sequential writes can be two orders of mag- 
nitude faster than random writes. In addition, random 
writes are degrading to the life of the flash drive, since 
each block can support only a limited number of erasures. 
TF-IDF can be implemented using a counting hash table. 
In general, hash tables are a particularly challenging case 
for the flash drive because this data structure is inherently 
dependent upon the randomness of the hash function, as 
opposed to the spatial locality of the data. This makes 



it difficult to avoid the random writes incurred during the 
construction of the counting hash table for TF-IDF. In this 
paper, we will study the design landscape for the develop- 
ment of a hash table for flash storage devices. We demon- 
strate how to effectively design a hash table with two re- 
lated hash functions, one of which exhibits a data place- 
ment property with respect to the other. Specifically, we 
focus on three designs based on this general philosophy 
and evaluate the trade-offs among them along the axes of 
query performance, insert and update times and I/O time 
through an implementation of the TF-IDF algorithm. 



1 Introduction 

In recent years, advances in hardware technology have 
led to the development of flash devices. These devices 
have several advantages such as faster seek times because 
of a lack of moving parts. Furthermore, they are more 
energy-efficient because of their use of non-mechanical 
techniques for data storage [3] [6]]. Such drives are ex- 
tremely fast for random read operations since they do not 
require the mechanical seeks necessary for disks. 

While data access is extremely fast, the writes to the 
drive can vary in speed depending upon the scenario. Se- 
quential writes are quite fast, though random writes can 
be over two orders of magnitude slower. The reason for 
this is the low level of granularity of erasing and updating 
data on disks. In fact, depending on the management of 
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the Flash Transfer Layer (FTL), random updates in a flash 
drive may be significantly slower than random writes on 
disk in spite of the fact that the flash drive does not have 
any moving parts. On the other hand, random reads are 
extremely fast on the flash drive and are almost as fast as 
sequential reads. A second property of the flash drive is 
that it supports only a finite number of erase-write cycles, 
after which the blocks on the drive may wear out. 

The different trade-offs in read-write speed leads to 
a number of challenges for database applications, espe- 
cially those in which there are frequent updates to the 
underlying data. As a result, there has been a consid- 
erable amount of research 13 H21 E2 E3 EU on 
database operations in flash storage devices. In particu- 
lar, index structures are a challenge for the case of the 
flash drive because of their frequent updates with individ- 
ual records. Such updates can be expensive, unless they 
can be carefully batched with the use of specialized up- 
date techniques. The idea is to minimize the block over- 
head in random writes. This approach also reduces the 
number of erase-write cycles on the flash, which increases 
its effective life. 

The hash table is a widely used data structure in mod- 
ern database systems JS). A hash table relies on a hash 
function to map keys to their associated values. In a well 
designed table, the cost of insertion and lookup requires 
constant (amortized) time, and is independent of the num- 
ber of entries in the table. Such hash tables are commonly 
used for lookup, duplicate detection, searching and index- 
ing in a wide range of domains including database index- 
ing. A counting hash table is one in which in addition to 
the value associated with a key, a (reference) count is also 
kept up to date in order to keep track of the occurrences 
of a specific key-value pair. Counting hash tables are also 
widely used. In the programming languages and software 
engineering context, such tables can be used for object 
reference counting to aid in garbage collection and mem- 
ory leak detection activities (e.g. Java JVM Ifl2ll24l ). In 
the data mining context, such tables are often used to effi- 
ciently count the number of occurrences of a given pattern 
(e.g. frequent pattern mining ||30l ). In the database con- 
text, such tables are used for indexing (e.g. XML index- 
ing, and selectivity estimation] 1 1). In the computational 
linguistics and information retrieval context, such tables 
can be used to efficiently count the number of distinct 
words and the number of occurrences per word within a 



corpus or document collection [37). 

TF-IDF 1T61 . Term Frequency-Inverse Document Fre- 
quency, is a common technique used in text mining and 
information retrieval f32l . TF-IDF measures the impor- 
tance of a particular word to a document given a corpus 
of documents. Words that appear frequently, often re- 
ferred to as stop-words e.g. the, are given a lower TF-IDF 
score than words that are more rare as they are assumed 
to offer more information about a document's subject e.g. 
Macintosh. For query processing, such as a search en- 
gine, documents can be ranked by their relevancy using 
TF-IDF; the relevancy of a document increases if a word 
contained in a query has a high TF-IDF score. TF-IDF 
can also be used for document similarity. A set of key- 
words can be defined for each document; keywords are 
defined by those words with a TF-IDF score higher than 
a set threshold. Using a similarity measure between the 
resulting TF-IDF vectors of two documents can yield a 
similarity score between two documents. Computing the 
TF-IDF scores requires accumulating the occurrences of 
a term; this is an excellent application for counting hash 
tables. 

Hash tables are an enormous challenge for the flash 
drive because they are naturally based on random hash 
functions and exhibit poor access locality. Thus, the key 
property of the hash table, randomness, becomes a liabil- 
ity on the flash drive. This paper will provide an effective 
method for updates to the tables in flash memory, by us- 
ing a carefully designed scheme which uses two closely 
related hash functions in order to ensure locality of the 
update operations. Specifically, we will be designing a 
counting hash table. Counting hash tables pose an ad- 
ditional challenge since, unlike standard hash tables, a 
duplicate key-value pair requires an in-place update to 
the specific location. In-place updates are non-trivial and 
given their unpredictable nature, they place an additional 
burden beyond just insertions and lookups. 

This paper is organized as follows. The remainder of 
this section will discuss the properties of the flash table 
which are relevant to the effective design of the hash table. 
We will then discuss related work and the contributions 
of this paper. In Section [2] we will discuss our different 
techniques. Section [3 contains the experimental results. 
The conclusions and summary are contained in Section|4] 
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1.1 Properties of the Flash Drive 

The solid state drive (SSD) is implemented with the use 
of Flash memory; which comes in two types: NOR and 
NAND chips. NOR chips are faster and have higher life- 
time but NAND chips have higher capacity and have been 
adopted in most commodity mobile devices using SSDs. 

The most basic unit of access is a page which contains 
between 512 and 4096 bytes, depending upon the man- 
ufacturer. Furthermore, pages are organized into blocks 
each of which may contain between 32 and 128 pages. 
The data is read and written at the level of a page, with the 
additional constraint that when any portion of data is over- 
written at a given block, the entire block must be copied to 
memory, erased on the flash, and then copied back to the 
flash after modification. This process is performed auto- 
matically by the software known as the Flash Translation 
Layer (FTL) on the flash drive. Thus, even a small ran- 
dom update of a single byte could lead to an erase-write 
of the entire block. Similarly, an erase, or clean, can only 
be performed at the block level rather than the byte level. 
Since random writes will eventually require erases once 
the flash device is full, it implies that such writes will re- 
quire block level operations. Thus, the overhead for the 
case of random writes can be very large, unless one is 
careful about the techniques used for data modification. 
On the other hand, sequential writes on the flash are quite 
fast; typically sequential writes are two orders of magni- 
tude faster than random writes. 

Another technological limitation of the flash drive is 
that it can typically support only a limited number of 
erase-writes. After this, the blocks on the flash may de- 
grade and they may not be able to support further writes. 
Typically, a flash drive can support between 10,000 to 
100,000 erase writes. In this respect, random writes are 
extremely degrading to the flash because they may trigger 
many erase writes. Therefore, it is essential to batch as 
many updates as possible on blocks. This is particularly 
difficult for the case of the hash table because it often con- 
tains a large fraction of cells which are not updated and 
the writes are inherently random in nature. 

1.2 Related Work 

Flash devices have recently found increasing interest in 
the database community because of their fast random read 



and sequential write performance. Flash devices have 
been used in enterprise database applications E3l . as a 
write cache to improve latency Ifl3l , and it is also used 
as an intermediate structure to improve the efficiency of 
migrate operations in the storage layer [21]. Methods 
for page-differential logging for efficiently storing data on 
flash devices in a DBMS independent way are discussed 
in |20l . Other database applications such as the design 
of dynamic self-tuning databases and the maintenance of 
database samples have been discussed in 1128"! l29l . There 
has also been work on designing tree-indexes on raw flash 
devices ifTTl l35l and indexes to deal with the random- 
write problem 1 25 1 . 

Rosenblum and Ousterhout proposed the notion of log- 
structured disk storage management [ 3 1 1 that relies on the 
assumption that the reads are cheap (as they are served 
from memory) and the writes are expensive (due to disk 
seeks and rotations). Not surprisingly, mechanisms sim- 
ilar to log-structured file systems are adopted in modern 
SSDs either at the level of FTL or at the level of file sys- 
tem to handle issues related to wear-leveling and erase- 
before-write J^ElQjBCESlGIl- As we discuss later, some 
of our buffering strategies are also inspired from log- 
structured file systems. Our design exploits the strength 
of flash-based storage devices in fast sequential writes, 
and tries to alleviate the problem of random writes. 

There have been hash tables designed with SSDs in- 
cluding the work presented in [4| in the context of data 
intensive networked systems, in the context of wimpy 
nodes, [10] is in the context of data de-duplication, ll36ll 
energy efficient memory sensors, and IfTTl persistent stor- 
age as write and/or read caches. In [26 1 designs a tree 
index for flash devices. However, [26 1 does not address 
duplicate keys thus it cannot handle a counting hash. 

In this work, we design a counting hash table that main- 
tain frequencies, and this has not been addressed thus 
far. We store the hash table on the SSD which is not 
seen in the designs of lIBI [TTI . Unlike most of the exist- 
ing strategies that rely on simple memory -based buffering 
schemes, we design a novel combination of memory- and 
disk- based buffering scheme. Our method leverages the 
strengths of SSDs (fast sequential/random reads, fast se- 
quential writes) to effectively address the weaknesses in 
SSDs (random writes, write endurance). We would like 
to emphasis that the works presented in this section do 
not handle a counting hash table, which is required by al- 
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gorithms like TF-IDF, but our proposed hash table designs 
will. 

1.3 Contributions of this paper 

In this paper, we will design a hash table for the SSD. The 
hash table is a particularly challenging case compared to 
the case of conventional index structures because it is in- 
herently dependent upon randomness as opposed to spa- 
tial locality. Index structures, which are dependent upon 
spatial locality, are much easier to update because the spa- 
tial locality can be leveraged to perform block updates of 
particular regions of the index. This is non-trivial for the 
case of the hash table in which the randomness guarantees 
that successive updates may occur at completely unrelated 
places on the hash table. As a result, it is much more dif- 
ficult to cluster updates for the purpose of block updates 
in a hash table because successive updates may occur at 
widely unrelated places on the hash table. In this work, 
we make the following specific contributions - (i) We 
propose a mechanism to support large counting hash ta- 
bles on SSDs via a two-level hash function, which ensures 
that the random update property of flash devices is effec- 
tively handled, by localizing the updates to SSD; (ii) We 
devise a novel combination of memory- and disk- based 
buffering scheme that effectively addresses the problems 
posed by SSDs (random writes, write endurance). While 
the memory-resident buffer leverages the fast random ac- 
cesses to RAM; the disk-resident buffer exploits fast read 
performance and fast sequential/semi-random write per- 
formance of SSDs; (in) We perform a detailed empirical 
evaluation to illustrate the effectiveness of our approach 
by demonstrating the TF-IDF algorithm using our hash 
table. 

2 A Flash-Friendly Hash Table 

In this section, we will introduce a hash table which is 
optimized for flash storage devices. We will introduce a 
number of different schemes for implementing the hash- 
table as well as basic hash table operations on these de- 
signs. 

The major property of a hash table is that its effec- 
tiveness is highly dependent upon updates which are dis- 
tributed randomly across the table. On the other hand, 



in the context of a flash-device, it is precisely this ran- 
domness which causes random access to different blocks 
of the SSD. Furthermore, updates which are distributed 
randomly over the hash table are extremely degrading in 
terms of the wear properties of the underlying disk. This 
makes hashing particularly challenging for the case of 
flash devices. 

Hash table addressing is of two types: open and closed, 
depending upon how the data is organized and collisions 
are resolved. These two kinds of tables are as follows. 

• Open Hash Table: In an open hash table, each slot 
of the hash table corresponds to multiple data entries. 
Thus, each slot is a container of entries which map 
onto that value of the hash function. Each entry of 
the collection is a key and frequency pair. 

• Closed Hash Table: In a closed hash table, the en- 
tries are accommodated within the hash table itself. 
Thus, the hash table slot contains the hashed string 
and its frequency. However, since multiple objects 
cannot be mapped onto the same entry, we need a 
collision resolution process, when a hashed object 
maps onto an entry which has already been filled. In 
such a case, a common strategy is to use linear prob- 
ing in which we cycle through successive entries of 
the hash table until we either find an instance of the 
object itself (and increase its frequency), or we find 
an empty slot in which we insert the new entry. We 
note that a fraction of the hash table (typically at least 
a quarter) needs to to be empty in order to ensure that 
the probing process is not a bottleneck. The fraction 
of hash table which is full is denoted by the load fac- 
tor /. It can be shown that 1/(1 - /) entries of the 
hash table are accessed on the average in order to 
access or update an entry. 

In this paper, we will use a combination of the open and 
closed hash tables in order to design our update structure. 
We will use a closed hash table as the primary hash ta- 
ble which is stored on the (Solid State) drive, along with 
an update hash table which is open and available in main 
memory. The hash functions of the two tables are dif- 
ferent (since the number of entries in the secondary hash 
table are much lower than the first), but they are related 
to one another in a careful way, so as to guarantee lo- 
cality of updates. We will discuss this slightly later. The 
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secondary hash table is updated for each incoming record; 
from time to time, portions of the secondary table are used 
in order to update the primary table in batch mode. The 
batch-updates are scheduled in a way so as to minimize 
the wasteful erase-writes in the update process. 

We assume that the primary hash table contains q 
entries, where q is dictated by the maximum capacity 
planned for the hash table for the application at hand. 
The secondary hash table contains \q/r \ entries where 
r « q. The hash function for the primary and secondary 
hash tables are denoted by g(x) and s(x), and are defined 
as follows: 

g(x) = (a • x + b)mod(q) (1) 

s(x) — ((a • x + 6)mod(g))div(r) (2) 

In general, the scheme will work with any pair of hash 
functions g(x) and s(x) which satisfy the following rela- 
tionship: 

s(x) = g(x)div(r) (3) 

It is easy to see that the entries which are pointed to by 
a single slot of the memory-resident table are located ap- 
proximately contiguously on the drive-resident (closed) 
table, because of the way in which the linear probing pro- 
cess works. This is an important observation, and will be 
used at several places in ensuring the efficiency of the ap- 
proach. Linear probing essentially assumes that items that 
collide onto the same hash function value will be contigu- 
ously located in a hash table with no empty slots between 
them. Specifically, the mth slot on the secondary table, 
corresponds to entries starting from r ■ (m — 1) + 1 upto 
entry r ■ m in the primary table. We note that most entries 
which would be pointed to by the mth slot of the sec- 
ondary table would also map onto the afore-mentioned en- 
tries in the primary table, though this would not always be 
true because of the overflow behavior of the linear prob- 
ing process beyond these boundaries. 

2.1 Desirable Update Properties of an SSD- 
based Hash-Table 

In this subsection, we will provide a broad overview of 
some of the desirable update properties of a hash-table. 
In later subsections, we will discuss how these goals are 



achieved. A naive implementation of a hash table will 
immediately issue update requests to the hash table as the 
data points are received. The vast majority of the write 
operations will be random page level writes due to the 
lack of locality, which is inherent in hash function design. 
As mentioned before, the cost of such operations will also 
increase the cost of cleans and random writes. 

A desirable property for a hash table would be block- 
level updates and semi-random writes. The block-level 
update refers to the case when there are multiple updates 
written to a block, and they are all accomplished at one 
time. If there are k updates written to a block, we should 
combine them into one block-level write operation. This 
can reduce the number of cleans from k to one. The semi- 
random writes refer to the fact that the updates to a par- 
ticular block are in the same order as they are arranged on 
the block, even though updates to different blocks may be 
interleaved with one another. 

We give an example of semi-random writes. If we con- 
sider the pair (Blockid,pageid) sequence (3,2), (2, 1), 
(4,3), (2,4), (2,5), (3,4), (4,6), (3,5), (3,7), this is 
considered semi-random, because the page updates to a 
particular block are arranged sequentially by their order 
of page id. Recall that sequential write patterns improve 
latency |28|. This is because of how the flash translation 
layer works. The existing methods in the flash translation 
layer are typically lazy; when the ith logical page of a 
block is written, the FTL copies and writes only the first i 
pages to a newly allocated block, instead of all the pages. 
Later, when page j > i is modified, only the pages (i + 1) 
to j are moved and written to the new block. Note that 
this would not be possible if j were less than i. The semi- 
random writes would improve the write latency of an SSD 
because SSD write performance improves under sequen- 
tial pattern of writes [28 1. Thus, the sequential ordering 
is useful in minimizing the unnecessary copying of pages 
from old blocks to new blocks. 

2.2 Hash table designs 

A variety of low level structures can be maintained in or- 
der to accomplish the desirable properties discussed ear- 
lier. We will design a number of such hash table mainte- 
nance schemes. All of these schemes use a combination 
of these low level structures in different ways. However, 
we would like to introduce these low level structures at 
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this stage in order to ease further discussion. Recall that 
we combine an open hash table in main memory with a 
closed-hash table on the SSD. This open (or secondary) 
hash table is typically implemented in the form of a RAM 
buffer denoted as Hr. The RAM buffer will contain up- 
dates for each block of the SSD and execute batch up- 
dates to the primary hash table on disk, or data segment 
(denoted by He), at the block level. This approach can 
reduce block level cleaning operations. 

2.3 Memory Bounded Buffering 

The overall structure of the common characteristics of the 
hash table architectures presented in this paper is illus- 
trated in Figure Q] We refer to this scheme as Memory 
Bounded Buffering or MB. The RAM buffer in the dia- 
gram is an open (or secondary) hash table and the data 
segment is a closed (or primary) hash table. There are 
s slots, each of which corresponds to a block in the data 
segment. The maximum capacity of the data segment is q 
pages, r pages per block and g entries per page. Thus, the 
number of slots in the secondary hash table, s, must be 
equal to q/r. Updates are flushed onto the SSD one block 
at a time. Because of the relationships between the hash 
functions of the primary and secondary table, the merge 
process of a given list requires access to only a particular 
set of SSD blocks which can be maintained in main mem- 
ory during the merging process. This may of course in- 
volve the insertion of new items that are not present in the 
data segment and items that collide with entries already 
inside of the data segment. 

2.4 Memory and Disk Bounded Buffering 

Since Hr is main-memory resident, it is typically re- 
stricted in size. Therefore, a second buffer can be imple- 
mented on the SSD itself. This new segment is referred 
to as the change segment or Se- The change segment 
acts as a second level buffer. When Hr exceeds its size 
limitations, the contents are sequentially written to the 
change segment at the page level starting from the first 
available page in an operation known as staging. When 
full, the change segment merges with the data segment 
and begins from the top of the change segment. A page 
in the change segment may contain updates from multi- 
ple blocks because pages are are packed with up to g en- 



tries irrespective of their slot origin. Thus, the change 
segment is organized as a log structure that contains the 
flushed updates of the RAM buffer. This takes advantage 
of the semi-sequential write performance of the SSD and 
increases the lifetime of the SSD. The space allocated to 
the change segment is in addition to the space allocated to 
the data segment. This hash table (with change-segment 
included) is illustrated in Figure [2] It is important to note 
that a stagef ) operation differs from a mergef ) operation in 
two ways, Specifically, stages write at the page level while 
merges operate at the block level. Furthermore, stages in- 
volve updates to the change segment while merges involve 
updates to the data segment. 

There are two types of architectures for the change seg- 
ment. In the first design, the change segment Se is viewed 
as a collection of blocks where each block holds updates 
from multiple lists from Hr. In other words, multiple 
blocks in the data segment are mapped to a single block 
in the change segment. We arrange the change segment 
in a way such that each change segment block holds the 
updates for k data segment blocks. The value of k is con- 
stant for a particular instantiation of the hash-table, and 
can be determined in an application-specific way. For an 
update-intensive application, it is advisable to set k to a 
smaller value at the expense of SSD space. 

When a particular change segment block is full, we 
merge the information in the change segment to the data 
segment blocks. The advantage of this method is best 
demonstrated if the RAM buffer is small. In that case, 
it will cause frequent merges onto the data segment un- 
der the MB design of the hash table. By adding the 
change segment, we are providing a more efficient buffer- 
ing mechanism. Staging a segment is more efficient than 
merging it because the change segment is written onto 
the SSD with a straightforward sequential write, which 
is known to be efficient for SSD. This approach is called 
Memory Disk Buffering or MDB. 

In this variation of the MDB scheme, (which we hence- 
forth will refer to as MDB-L for MDB-Linear) the space 
allocated for the change segment is viewed as a single 
large monolithic chunk of memory without any subdivi- 
sions. This view resembles a large log file. Thus, the 
change segment blocks are not assigned to k data seg- 
ment blocks. The writes to the change segment are ex- 
ecuted in FCFS fashion. This type of structure mimics 
a log-structured file system and fully takes advantage of 
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the SSD strength in sequential writes. We maintain a 
collection of pointers to identify the ranges, measured in 
pages, that a particular slot in the RAM buffer has been 
staged. These pointers are similar to the indexing infor- 
mation [31 1 maintained in log-structured file systems that 
helps in reading the files from the log efficiently. 

A merge operation is triggered when the change seg- 
ment is full. The collection of pointers can be used to 
identify the pages a particular block was staged. This pro- 
cess produces random reads on the change segment be- 
cause the ranges span multiple stage points. The reads are 
also repetitive because a page may contain entries from 
multiple blocks because of the staging process. During a 
stage, entries from multiple blocks may be packed into a 
single page. During a merge, each page will be requested 
by each data segment block that has entries staged onto it. 
After all of the pages for a particular data segment block 
are read from the change segment, the entries are merged 
with the corresponding data segment block. 

RAM Buffer Data Segment 

(open hash table) (closed hash table) 









Block Level Updates 
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Figure 1: Hash Table with RAM buffer 
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Figure 2: Hash Table with RAM buffer and Change Seg- 
ment 



2.5 Element Insertion and Update Process 

The element insertion process is designed to perform in- 
dividual updates on the memory -resident table only, since 
this can be done in an efficient way. Such changes are 
later rolled on to the RAM buffer (which is in turn rolled 
on to the change segment for some of the schemes). 

For each incoming record x, we first apply the hash 
function s(x) in order to determine the slot to which the 
corresponding entry belongs. We then determine if the 
key x is present inside the corresponding slot s(x). If 
the element is found, then we increase its frequency. The 
second case is when the key x is not contained inside the 
buffer which is pointed to by the slot s(x). In such a case, 
we add the key a; as a new element to the RAM buffer. The 
size of the RAM buffer increases in this case. If the RAM 
buffer has grown too large, it is flushed either directly onto 
the change segment or the SSD itself, depending upon 
whether or not the change segment is implemented in the 
corresponding scheme. Because of the relationship be- 
tween the hash functions of the RAM buffer and the SSD 
based hash table, such an update process tends to preserve 
the locality of the update process, and if desired, can also 
be made to preserve semi-random write properties. 

During the insertion process of new items, linear prob- 
ing may occur because Hd is a closed hash table. If the 
linear probing process reaches the end of the current SSD 
block, then we do not move the probe onto the next block. 
Rather, an overflow region is allocated within the SSD ta- 
ble which takes care of probing overflows beyond block 
boundaries. The last index of the last page of an SSD 
block becomes a pointer assigned to the overflow region. 
The entry that was resident at this position now resides 
in the conflict region alongside the newly inserted entry. 
Thus, the data segment is a collection of blocks with log- 
ical extensions. The overflow region, a collection of SSD 
blocks, is allocated when the hash table is created. Its size 
can vary depending on user specifications. The blocks 
are written one at a time and the pages are assigned when 
needed. When a overflow region is needed, a page is as- 
signed. If another region is needed, another page is allo- 
cated and the previous page points to the newly allocated 
page. When a block is full, another block is used. An 
overflow block may contain entries from multiple blocks 
in the SSD data segment Hd because the regions are al- 
located a page at a time when they are needed. 
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2.6 Implementing Deletion Operations 

It is also possible to implement deletion operations in the 
hash table. There are two kinds of deletion possible: 

• A deletion operation reduces the frequency of an 
item by 1. This case is trivially solvable by using 
the approach for insertion, except that we use a fre- 
quency of -1 for the incoming element. Entries with 
frequency value of are not retained in the memory- 
resident table, and entries with negative frequencies 
are allowed. These negative frequencies are appro- 
priately transferred to the drive-resident table during 
batch updates. 

• A deletion operation completely removes an element 
from the hash table irrespective of its frequency. This 
case is more complex, and is discussed below. 

If an item is deleted from the data segment, it can ei- 
ther be removed or its frequency can be set to zero. If it is 
removed, there will be added complexity to queries, up- 
dates, and inserts. This is because of the way in which the 
linear probing process is implemented. During a query, if 
an empty slot is encountered during the probing process, 
the query terminates (with the guarantee that the item is 
not found anywhere else in the hash table) because of this 
contiguity assumption. However, the removal of entries 
in a deletion process can violate this contiguity assump- 
tion. This is because the empty slot encountered during 
linear probing may be the result of a previously filled slot, 
which was removed by a deletion process. In such a case, 
the desired entry may reside beyond the empty slot, and 
it may no longer be possible to terminate after the first in- 
stance of an empty slot. This could potentially invalidate 
the correctness of the query processing. 

This problem can be handled during the merge of a 
block or periodically. In both cases, the block is loaded 
into main memory. The entries are hashed inside a main 
memory block, but the deleted items are ignored during 
this process. This will ensure that the entries are contigu- 
ous. The newly hashed entries are then re-written directly 
to the data segment block. 

2.7 Query Processing 

In the simple hash table, queries are fulfilled by an I/O re- 
quest to the data segment. However, in our proposed de- 



signs the corresponding entry may be found either in the 
change segment or the RAM buffer. Therefore, the query 
processing approach must search the change segment and 
the RAM buffer in addition to the data segment. Thus, the 
frequency of a queried item is the total frequency found in 
the change segment, RAM buffer, and data segment. The 
search of the RAM buffer may be inexpensive because 
it is in main memory. On the other hand, access to the 
change segment requires access to the SSD. 

For the case of the data segment, the query processing 
approach is quite similar to that of standard hash tables. 
A hash function is applied to the queried entry in order 
to determine its page level location inside of the data seg- 
ment. If the entry is found, the frequency is returned. If 
the item is not found, linear probing begins because the 
disk hash table is a closed hash. Linear probing halts if 
the entry is found or an empty entry was discovered. The 
query processing of the change segment requires locating 
the entry. The location of the entry may reside in multiple 
segments due to repeated flushing of the RAM buffer. 

Recall that MDB partitions the change segment. When 
a RAM bucket is staged, it is always written to the same 
change segment block. We locate the appropriate change 
segment block and bring it into memory to be searched. 
In MDB-L, RAM buckets can reside on multiple pages, 
and thus we must issue random page reads. We expect 
MDB-L to be faster because of page level access. 

3 Experiments 

In this section we present an empirical analysis of the hash 
table designs discussed in the previous sections. We eval- 
uate the performance of the three main schemes discussed 
in this article, namely MB, MDB and MDB-L. Broadly, 
our objectives are to understand the I/O overheads of var- 
ious schemes and their query performance. Additionally, 
since SSD disks permit a limited number of clean opera- 
tions, it is also important to quantify the wear rate of the 
devices. We begin with a discussion of the experimental 
setup. 

3.1 Experimental Setup 

To evaluate our hash table configurations, we used 
the DiskSim simulation environment (7J, managed by 
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Carnegie Mellon University; and the SSD Extension for 
this environment created by Microsoft Research 11271 . We 
used the Disksim simulator with an SSD extension by Mi- 
crosoft Research. We operated Disksim in slave mode. 
Slave mode allows programmers to incorporate Disksim 
into another program for increased timing accuracy. 

We ran our experiments on three different configura- 
tions of the latest representative NAND flash SSDs from 
Intel (see Tabl^TJ for details). Among these, two SSDs 
are MLC (Multi-Level Cell) and the other is SLC (Single- 
Level Cell) based SSDs. We have chosen from both MLC 
and SLC because of their differing characteristics. While 
MLCs provide much higher data density and lower cost 
(which makes it more popular), it has a shorter lifespan 
and slow read/write performances. SLC, on the other 
hand, has faster read/write performances and a signifi- 
cantly longer lifespan. SLCs also entail lower internal er- 
ror rate making them preferable for higher performance, 
high-reliability devices fl4l . 

All hash table experiments involve inserting, deleting, 
and updating key value pairs. The size of the RAM buffer 
is parameterized on the size of the data segment and ex- 
pressed as a percentage The rationale here is that we be- 
lieve that an end application may need to create multiple 
hash tables on the same SSD. Moreover, the characteris- 
tics of access may vary across applications (i.e. one may 
want different RAM buffer sizes for each hash table). The 
change segment is likewise parameterized and the over- 
flow segment for all experiments was set to a minimal 
value (one block) since this was found to be sufficient. 
Key- value pairs are integer pairs. 

We conducted our experiments on a DELL Precision 
T1500 with an Intel ® Core ™ i7 CPU 860@2.8GHz 
with 8Gbs of memory and 8 cores running Ubuntu 10.04. 
Our code was implemented in C++. The Hr data struc- 
ture utilized the C++ Standard Template Library [33 1 for 
its implementation. The RAM buffer buckets that corre- 
spond to data segment blocks are arranged inside a C++ 
vector and their indexes correspond to their placement on 
the data segment. For example, the first block inside the 
data segment corresponds to the first block in the RAM 
buffer. The data segment can be viewed as an array logi- 
cally divided into blocks and further divided into pages. 
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Table 1: SSD Configurations 



3.2 TF-IDF 

To demonstrate the efficacy of our methods, we imple- 
mented the TF-IDF algorithm, see Section [T| using our 
hash table designs. In our experiments, we compute the 
TF-IDF score of all words in our corpus. Our hash table 
contains the frequencies of each keyword. As we read in 
each document, we compute the frequency of each word 
and store it in our counting hash table. 

3.3 Data Sets 

We use two datasets: a Wikipedia and Meme Tracker 
dump, which are essentially large text files. 
Wiki: The first data set we use is a collection of randomly 
collected Wikipedia articles. We chose 100, 000 ran- 
dom wikipedia articles collected from Wikipedia's pub- 
licly available dumpQ. Our 100,000 random articles 
were approximately 1GB in size. This dataset contains 
136, 749, 203 tokens (keywords) with 9, 773, 143 unique 
entries. For the testing of this data set, our hash table 
was set to 100MB. On a 128 page per block SSD this 
amounts to approximately 205 SSD blocks allocated to 
the data segment. 

To evaluate I/O performance during inserts or update^ 

1 See http://dumps.wikimedia.org/ 

2 As noted earlier deletes are handled as inserts with a negative count. 
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we simply inserted (or update) tokens (corresponding 
counts) into the hash table. Statistics and times for vari- 
ous operations (cleans, merges, stages etc.) are computed 
and discussed shortly. 

To evaluate query performance we first processed 35 
million tokens. Subsequently, roughly 100 million words 
were inserted. Simultaneously with inserts, we also issued 
a million queries interleaved randomly across inserts. A 
query is a hash table lookup. In the TF-IDF context, this 
corresponds to "how frequent is a keyword" which allows 
us to compute the TF-IDF score of a keyword. Of these 
queried items, on average (spread across 10 different ran- 
dom workloads) 933, 139 of them were present inside the 
hash table at query time. 

Meme: The second data set we report is the Meme- 
Tracker datase{| We downloaded the August 2008 
dataset. We found 17, 005, 975 unique entries and 
402, 005, 270 total entries. Since this dataset is slightly 
larger our hash table size was 2QQMB. On a 128 page 
per block SSD this translates to approximately 410 SSD 
blocks allocated to the data segment. 

I/O performance is evaluated similar to how we handle 
things for the Wiki dataset. For query performance, the 
first 130 million words were inserted into the hash table. 
Subsequently, the remaining 270 million words were in- 
terleaved with about one million queries. Of these queried 
items, (spread across 10 random workloads) 959, 731 of 
them were found inside the hash table. 

3.4 Query Time Performance 

In all our graphs, the Y-axis is the average time per query 
in milliseconds. Results on both Wiki and Meme are pro- 
vided in Figure [3] The main trends we observe include: 
i) the query time for MB are quite low (does not have a 
change segment); ii) the query time for MDB is quite high 
and does not drop significantly with reduction in change 
segment size; iii) the query time for MDB-L improves dra- 
matically with a reduction in the change segment size; and 
iv) query times for Meme are marginally lower than the 
query times for Wiki for both MDB and MDB-L . These 
trends can be explained as follows. Query costs for MB 
are essentially fixed since they essentially have to com- 
bine the counts from the memory buffer (negligible) and 

3 See http://memetracker.org/ 



require typically a page read to access the requisite in- 
formation from the data segment. Query costs for MDB 
require consolidation of information from the memory 
buffer (negligible) and from the change segment (expen- 
sive - dominated by block level reads) and the data seg- 
ment (usually a single page read). Query costs for MDB-L 
require consolidation across the memory buffer (negligi- 
ble), the change segment (typically requiring a few page 
reads which are significantly reduced as the size of the 
change segment is reduced) and the data segment (usu- 
ally a single page read). This is reflected in our first ex- 
periment, shown in Figure [3a] for MLC-1, in which we 
varied the change segment while fixing the RAM buffer 
to 5%. With regards to the difference between Wiki and 
Meme query times, upon drilling down into the data, we 
find that on average there are 1 1.5% more page reads for 
Wiki. This may be an artifact of the linear probing costs 
within both datasets, given the fact that the ratio of num- 
ber of unique tokens to hash table size is slightly higher 
for Wiki. 

For the second experiment, again on the MLC-1 config- 
uration, we fixed the change segment to 12.5% and varied 
the the RAM buffer for both datasets(see Figure [3bb. We 
observe that with an increase in RAM buffer size that : i) 
MB shows a negligible change in average query time; ii) 
MDB shows decrease in average query time; iii) MDB-L 
shows a significant decrease in average query time perfor- 
mance; and iv) query times on Meme are typically faster 
than those on Wiki. 

To explain these trends we should first note that increas- 
ing RAM buffer size has the general effect of reducing the 
number of stage operations, and thus the average size of 
the amount of useful information within the change seg- 
ment. Thus the time it takes to consolidate the information 
within the change segment in order to answer the query, is 
on average lower, for both MDB and MDB-L. For MDB- 
L the improvement is more marked because fewer page 
reads are required. The explanation for why query times 
are lower for Meme are similar to what we observed for 
the previous experiment. 

The third experiment we performed on query time per- 
formance was to evaluate the performance of the three 
SSD configurations on the Wiki dataset shown in Fig- 
ure [3c] Here the RAM buffer was set to 5% and the 
Change Segment was set to 12.5%. The results are along 
expected lines in that average query times are slightly bet- 
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ter on MLC-2 and SLC over MLC- 1 for the MDB method. 
The superior read performance for both page level and 
block level operations are the primary reason. This differ- 
ence is marked in the case of MDB but for both MDB-L 
and MB the difference is negligible. MDB requires a block 
level read for a query and the performance difference for 
this type of operation is more pronounced for MLC-2 and 
SLC, over when compared with MLC-1. 

To conclude we should reiterate that the query perfor- 
mance times we observe here are for our update-intensive 
query workload where we interleaved queries with in- 
serts (averaged over multiple runs). In this environ- 
ment the query time performance of MB is always the 
fastest. For reasonable parametric settings MDB-L typ- 
ically approaches MB in performance while MDB is al- 
ways an order of magnitude worse in terms of perfor- 
mance. We should note that we also evaluated query 
times for all three methods in more stable settings (few 
updates/inserts) In such a stable setting we found that 
the query times for all three methods was identical. The 
query cost essentially boils down to a page read or two 
on the data segment (since the change segment is empty 
and does not factor). Furthermore, MDB is bounded by 
a single block read while the query time of MDB-L may 
vary. However, as our results indicate, the pointer guided 
page level accesses of MDB-L provide efficient read ac- 
cess that outperforms MDB. 

3.5 I/O Performance 

In this section we examine the I/O performance of the 
three strategies. To ignore the impact of queries in this 
section, our workloads for both datasets simply insert all 
the tokens or words into their respective hash tables. In 
our first experiment, we report overall I/O cost from the 
perspective of the SSD for the three SSD configurations 
for both Wiki (see Figure |4aJ and Meme (see Figure l4bl. 
The RAM buffer is set to 5% and the change segment is 
set to 12.5% in this experiment. 

The main trend we observe are that both MDB-L and 
MDB require comparable yet significantly lower I/O costs 
than MB. This is primarily attributable to the presence of 
the change segment which enables sequential (MDB-L) 
or semi-random writes (MDB). Additionally, as we shall 
see shortly, MB requires a large number of erasures which 
also contribute to the overall I/O cost. Another trend we 



observe is that among the SSD configurations SLC and 
MLC-2 offer comparable performance with a slight edge 
to SLC. MLC-1 is quite a bit slower. This is primarily 
attributable to the superior write bandwidth of SLC and 
MLC-2. Finally we observe that the overall I/O times are 
higher for Meme over Wiki (larger dataset and larger hash 
table). 

Not shown in our reports are the performance measures 
for a hash table without the use of a buffer. The advan- 
tage of this scheme is fast query times because queries are 
only page level reads on the data segment. However, re- 
sults show that such a hash table would induce 1,680,323 
cleans for the Wiki dataset and 6,669,932 cleans for the 
Meme dataset. The I/O performance are on the order 
of 615 times slower on the Wiki dataset and 1500 times 
slower for the Meme dataset for reported times for the re- 
sults in Figure|4] This increase is caused by cleaning time 
and random page writes. It is clear that there is a benefit 
from our designs. 

Next we discuss the breakdown of page and block level 
operations, merge and stage operations. To better under- 
stand the I/O performance, we drill down a bit further 
on some of the core operations within the I/O subsys- 
tem for all three methods. Table [2] shows the block and 
page level operations, number of merges and number of 
stages for each method for varying RAM buffer percent- 
ages and varying change segment sizes on the Wikipedia 
dataset. For each method, "Block" represents the num- 
ber of block operations and "Page" represents the num- 
ber of page level operations. The percentage informa- 
tion with the Block value represents the ratio of Block 
level operations to total number of (Block + Page) oper- 
ations. Columns "Merges" and "Stages" list the number 
of merges and stages for a method, respectively. Note that 
MB has no page level operations and does not leverage 
staging. 

Before we discuss the main trends we should highlight 
that the number of merges for each of the methods is quite 
different. It should be noted that a merge operation for 
each of the methods is not exactly the same. Recall that a 
merge for MB essentially entails unifying the contents of 
the memory buffer with the data segment. In fact the num- 
ber of merge operations for MB is identical to the number 
of stage operations for the other two methods. Staging 
in both these methods involve unifying the contents of 
the memory buffer with the change segment via sequen- 
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Table 2: Block and Page level Operations on Wiki. 



tial writes or semi-random writes. A merge operation for 
MDB-L entails unifying the entire contents of the change 
segment with the data segment. A merge operation for 
MDB entails merging the contents of one block within the 
change segment (that has filled up) with the contents of 
the data segment. This explains the difference in the num- 
ber of merges for these methods. 

The main trends we observe from this table are noted 
below. 

1 . The number of basic I/O operations (Block, Page) 
and meta-level I/O operations (Merges, Stages) (and 
therefore the total I/O cost) drops as we increase the 
RAM buffer size. The rationale for this is obvious. 

2. The ratio of block level operations to page level op- 
erations increases both with increasing RAM buffer 



size and with reducing the change segment size for 
both MDB and MDB-L. The number of stages de- 
creases faster than the number of merges if the RAM 
buffer is increased. If the change segment size is re- 
duced, the number of merges will increase because 
the change segment will fill more frequently. 

3. MDB-L typically has the lowest number of block 
level operations while MB requires the larges num- 
ber of block operations (which is significantly more 
expensive than page operations. The low block 
operations of MDB-L can be attributed to the linear 
change segment. 



4 MLC-1 is on the order of 30 and 50 times more expensive for block 
level reads and block level writes, MLC-2 is over 25 and 35, and SLC is 
over 24 and 28 respectively. 



13 



4. In terms of overall I/O costs MDB and MDB-L have 
a similar profile while MB is significantly more ex- 
pensive. 

Summing up the I/O performance it is fair to say that 
for most reasonable parameter settings MDB and MDB-L 
significantly outperform MB in terms of the cost of I/O 
from the perspective of the flash device. Additionally it 
should be noted that the merging operation within both 
MDB and MDB-L will happen completely within the SSD 
(allowing for an overlap of CPU operations - not reflected 
in any of the experiments) whereas a merge for MB and 
staging for the other two methods will require some CPU 
intervention. Also note that an MB merge operation is sig- 
nificantly more expensive than and MDB or MDB-L stage 
operation (random writes versus semi-random/sequential 
writes). 

3.6 Cleans 

In our next experiment we take a closer look at the num- 
ber of clean operations required by these methods for both 
datasets (see Figure Our graphs display the variation 
of RAM buffer size for both datasets and the variation of 
change segment size for the Wiki dataset along the Wiki 
dataset for the X-axis. The Y-axis is the amount of era- 
sures. The main trends and explanations for these trends 
are: i) the number of cleans goes down with increasing 
RAM buffer sizes since there are fewer stages and merges 
as shown in Figure |5al ii) the number of cleans is signif- 
icantly higher for MB compared to that of the other two 
methods because the change segment provides an extra 
level of buffering for MDB and MDB-L as shown in Fig- 
ure |5bJ iii) the number of cleans increases for MDB and 
MDB-L as we decrease the size of the change segment be- 
cause the change segment fills more often and thus there 
are more merges. MB does not use the change segment 
so it stays a constant value, and iv) the number of cleans 
for MDB and MDB-L are very similar with MDB-L be- 
ing slightly better. The reduction for MDB-L is clearly 
attributable to the linear change segment design. 

4 Conclusions 

Hash tables pose a challenge for flash devices. Updating 
an entry inside of a disk hash table may trigger an en- 



tire erasure of an SSD block. Repeatedly updating a hash 
table can be detrimental to the limited lifetime of the un- 
derlying SSD. A simple hash table without buffering can 
be implemented. It has superior query time but it induces 
a substantial amount of cleans and I/O cost. From our 
experiments, we believe that an SSD friendly hash table 
will have a RAM buffer and a disk based buffer that sup- 
ports semi-random writes. These features will increase 
the locality of updates and reduce the I/O cost of the hash 
table for both low and high end SSDs. Overall our results 
reveal that when one accounts for both I/O performance 
and query performance, MDB-L seems to offer the best 
of both worlds on the workloads we evaluated for reason- 
able parameter settings of change segment size and RAM 
buffer size. Furthermore, MLC-2 seems to offer the best 
trade-off currently when taking into account both cost and 
performance of the device. In the future, we would also 
like to extend our design to hash functions that do not rely 
on the mod operator (e.g. extendible hashing) and exam- 
ine various checkpointing methods. 
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